A Comparative Study of Hidden Web Crawlers

نویسندگان

  • Sonali Gupta
  • Komal Kumar Bhatia
چکیده

A large amount of data on the WWW remains inaccessible to crawlers of Web search engines because it can only be exposed on demand as users fill out and submit forms. The Hidden web refers to the collection of Web data which can be accessed by the crawler only through an interaction with the Web-based search form and not simply by traversing hyperlinks. Research on Hidden Web has emerged almost a decade ago with the main line being exploring ways to access the content in online databases that are usually hidden behind search forms. The efforts in the area mainly focus on designing hidden Web crawlers that focus on learning forms and filling them with meaningful values. The paper gives an insight into the various Hidden Web crawlers developed for the purpose giving a mention to the advantages and shortcoming of the techniques employed in each. Keywords— WWW, Surface Web, Hidden Web, Deep Web, Crawler, search form, Surfacing, Virtual Integration.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...

متن کامل

A Comparative Study of Ranking Techniques for Hidden Web and Surface Web

The web consist of Surface web and hidden web. Surface web is also known as publically indexable web. It can be accessed by search engines using hyperlinks present on the pages and using simple keyword matching schemes. Hidden web refers to content that is hidden behind HTML forms. This contains a large collection of data that are unreachable by link-based search engines. A study conducted at U...

متن کامل

A scale for crawler effectiveness on the client-side hidden web

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “clientside” Hidden Web. First, we perform a thorough analysis of the different client-side technologies and the main features of the web pages in order to determine the basic steps of the aforementioned scale. Then, we define the scale by grouping basic scenario...

متن کامل

Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “client-side” Hidden Web. To that end, we accomplish several tasks. First, we perform a thorough analysis of the different client-side technologies and the main features of the Web 2.0 pages in order to determine the initial levels of the aforementioned scale. Se...

متن کامل

An Improved Extraction Algorithm from Domain Specific Hidden Web

The web contains a large amount of information which is increasing by magnitude every day. The World Wide Web consists of Surface Web (Publicly Indexed Web) and the Deep Web which consists of Hidden Data, alsoreferred to by different names such as Hidden Web, Deepnet or the Invisible Web. A user can directly access the surface web through a Search Engine but to access the hidden data/informatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1407.5732  شماره 

صفحات  -

تاریخ انتشار 2014